The Siesta method for ab initio order-iV materials simulation 
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We have developed and implemented a self-consistent density functional method using standard 
norm-conserving pseudopotentials and a flexible, numerical LCAO basis set, which includes multiple- 
zeta and polarization orbitals. Exchange and correlation are treated with the local spin density or 
generalized gradient approximations. The basis functions and the electron density are projected on 
a real-space grid, in order to calculate the Hartree and exchange-correlation potentials and matrix 
elements, with a number of operations that scales linearly with the size of the system. We use a 
modified energy functional, whose minimization produces orthogonal wavefunctions and the same 
energy and density as the Kohn-Sham energy functional, without the need of an explicit orthogonal- 
ization. Additionally, using localized Wannier-like electron wavefunctions allows the computation 
time and memory, required to minimize the energy, to also scale linearly with the size of the system. 
Forces and stresses are also calculated efficiently and accurately, thus allowing structural relaxation 
and molecular dynamics simulations. 
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I. INTRODUCTION 



As the improvements in computer hardware and soft- 
ware allow the simulation of molecules and materials 
with an increasing number of atoms N, the use of so- 
called order-A^ algorithms, in which the computer time 
and memory scales linearly with the simulated system 
size, becomes increasingly important. These O(N) meth- 
ods were developed during the 1970's and 80's foij-|long 
range forcesEI and empirical interatomic potentialscl but 
only in the last 5-10 years for the much more complex 
quantum-mechanical methods, in which atomic forces are 
obtainecLby solving the interaction of ions and electrons 
togetheiu. Even among quantum mechanical methods, 
there are very different levels of approximation: empir- 
ical or semiempirical|-|Orthogonal tight-binding methods 
are the simplest onesuB; 'ab-initio' nonorthogonal-tight- 
binding and Jionself-consistent Harris-functional meth- 
ods are nextffil; and fully self-consistent density func- 
tional theory (DFT) methods are the most complex 
and reliablea. The implementation of O(N) methods in 
quantum-mechanical simulations has also followed these 
steps, with several methods already well established 
within the tight-binding formalismEl, but much less so 
in self-consistent DFTlJ. The latter also require, in addi- 
tion to solving Schrodinger's equation, the determina- 
tion of the self-consistent Hamiltonian in 0(N) itera- 
tions. While this is difficult using plane waves, a local- 



ized basis set appears to be the natural choice. One pro- 
posed approach are the 'blips' of Hernandez and GillanEj, 
regularly-spaced Gaussian-like splines that can be sys- 
tematically increased, in the spirit of finite-element meth- 
ods, although at a considerable computational cost. 

We have developed a fully self-consistent DFT, based 
on a flexible linear combination of atomic orbitals 
(LCAO) basis set, with essentially perfect 0{N) scal- 
ing. It allows extremely fast simulations using mini- 
mal basis sets and very accurate calculations with com- 
plete multiple-zeta and polarized bases, depending on the 
required accuracy and available computational power. 
In previous papersEjilj we have described preliminary 
versions of this method, that we call Siesta (Spanish 
Initiative for Electronic Simulations with Thousands of 
Atoms). There is also a reviewed of the tens of stud- 
ies performed with it, in a wide variety of systems, like 
metallic surfaces, nanotubes, and biomolecules. In this 
work we present a more complete description of the 
method, as well as some important improvements. 

Apart from that of Born and Oppenheimer, the most 
basic approximations concern the treatment of exchange 
and correlation, and the use of pseudopotentials. Ex- 
change and—correlation (XC) are treated within Kohn- 
Sham DFTtfl. Wa-allow for both the local (spin) den- 
sity approximation!!^ -tLDA/LSD) or the generalized gra- 
dient approximationtj (GGiU._-jWe use standard norm- 
conspying pseudopotentialsElEj in their fully non-local 
forrrJHj. y^ e a i so include scalar-relativistic effects and the 
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nonlinear partial-core-correction to treat exchange and 
correlation in the core regioncj. 

The Siesta code has been already tested and applied 
to dozens of systems and a variety of propertiesEij. There- 
fore, we will just illustrate here the convergence of a 
few characteristic magnitudes of silicon, the architypi- 
cal system of the field, with respect to the main preci- 
sion parameters that characterize our method: basis size 
(number of atomic basis orbitals); basis range (radius of 
the basis orbitals); fineness of the real-space integration 
grid; and confinement radius of the Wannier-like electron 
states. Other parameters, like the fc-sampling integration 
grid, are common to all similar methods and we will not 
discuss their convergence here. 



II. PSEUDOPOTENTIAL 

Although the use of pseudopotentials is not strictly 
necessary with atomic basis sets, we find them con- 
venient to get rid of the core electrons and, more 
importantly, to allow for the expansion of a smooth 
(pseudo)charge density on a uniform spatial grid. The 
theory and usaffft of first principles norm-conserving 
pseudopotentialstH is already well established. Siesta 
reads them in semilocal form (a different radial potential 
Vi(r) for each angular momentum I, optionally gener- 
ated scalar-relativisticalljOEj) from a data file that users 
can fill with their preferred choice. _We generally use the 
Troullier-Martins parameterizatiorc^l We transform this 
semilocal form into the fully non-local form proposed by 
Kleinman and Bylander (KB)Ej: 
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where SVi(r) = Vj(r) - V local {r)- xLn( r ) = 
Xto S ( r )^m(r) (with Y; m (r) a spherical harmonic) are the 
KB projection functions 



xrW = ^(r) W „(r). 



(4) 



The functions ipi n are obtained from the eigenstates tj)i n 
of the semilocal pseudopotential (screened by the pseudo- 
valence charge density) at energy ei n .using the orthogo- 
nalization scheme proposed by BlochO: 

^in(r) = ipin(r) - <Pin>{r) - — rH77^T7 — ~ ( 5 ) 



< (pin'\5Vi{r)\ipin> > 



1 d 2 1(1+1) 



Vi(r) 
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V H and V xc are the Hartree and XC potentials for the 
pseudo-valence charge density, and we are using atomic 
units (e = K = m e = 1) throughout all of this paper. 

The local part of the pseudopotential Vi oca i(r) is in 
principle arbitrary, but it must join the semilocal poten- 
tials Vi(r) which, by construction, all become equal to the 
(unscreened) all-electron potential beyond the pseudopo- 
tential core radius r core . Thus, 5Vi(r) — for r > r core . 
Ramer and Rappe havejaroposed that V/ oca /(r) be opti- 
mized for transferability^], but most plane wave schemes 
make it equal to one of the Vi(r)'s for reasons of effi- 
ciency. Our case is different because Vi oca i (r) is the only 
pseudopotential part that needs to be represented in the 
real space grid, while the matrix elements of the non- 
local part Vkb are cheaply and accurately calculated by 
two-center integrals. Therefore, we optimize Vi oca i(r) for 
smoothness, making it equal to the potential created by 
a positive charge distribution of the formEj 



p Locai (r) cx exp[-(sinh(a6r)/sinh(6))^ 



(7) 



V (r) + V xc (r)]^ ln (r) = e, n ^ /n (r) (6) 



where a and b are chosen to provide simultane- 
ously optimaLreal-space localization and reciprocal-space 
convergencecll. After some numerical tests we have taken 
b =1 and a =1.82/r core . Figure |l] shows V/ oca /(r) for sil- 
icon. 

Since Vj(r) = Vi oca i{r) outside r core , Xh^i 7 ") is strictly 
zero beyond that radius, irrespective of the value of e;„c3. 
Generally it is sufficient to have a single projector \fm 
for each angular momentum (i.e. a single term in the 
sum on n). In this case we follow the normal practice 
of making e/„ equal to the valence atomic eigenvalue e; , 
and the function <fii(r) in Eq. ^ is identical to the cor- 
responding eigenstate ipii. r )- I n some cases, particularly 
for alkaline metals, alkaline earths, and transition met- 
als of the first few columns, we have sometimes found 
it necessary to include the semicore states together with 
the valence statesE3. In these cases, we also include two 
independent KB projectors, one for the semicore and one 
for the valence states. However, our pseudopoitentials are 
still norm-conserving rather than "ultrasoft"E3. This is 
because, in our case, it is only the electron density that 
needs to be accurately represented in a real-space grid, 
rather than each wavefunction. Therefore, the ultrasoft 
pseudopotential formalism does not imply in Siesta the 
same savings as it does in PW schemes. Also, since the 
non-local part of the pseudopotential is a relatively cheap 
operator within Siesta, we generally (but not necessar- 
ily) use a larger than usual value of l^ax m Eq- @j mak- 
ing it one unit larger than the l m ax of the basis functions. 



III. BASIS SET 

Order- AT methods rely heavily on the sparsity of the 
Hamiltonian and overlap matrices. This sparsity re- 
quires either the neglect of matrix elements that are small 
enough or the use of strictly confined basis orJpitals, i.e., 
orbitals that are zero beyond a certain radiusQ. We have 
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FIG. 1: Local pseudopotential for silicon. Vi oca i is the un- 
screened local part of the pseudopotential, generated as the 
electrostatic potential produced by a localized distribution of 
positive charge, Eq. (Q), whose integral is equal to the valence 
ion charge (Z = 4 for Si). The dashed line is —Z/r. Vna is 
the local pseudopotential screened by an electron charge dis- 
tribution, generated by filling the first-£ basis orbitals with 
the free-atom valence occupations. Since these basis orbitals 
are strictly confined to a radius 4„„ Vna is also strictly zero 
beyond that radius. 



adopted this latter approach because it keeps the en- 
ergy strictly variational, thus facilitating the test of the 
convergence with respect to the radius of confinement. 
Within this radius, our atomic basis orbitals are products 
of a numerical radial function times a spherical harmonic. 
For atom /, located at Rj, 



<?Wn(r) = (t>iin(ri)Yi m (rz) 



(8) 



where 17 = r — Rf, r — |r| and f = r/r. The angu- 
lar momentum (labelled by I, m) may be arbitrarily large 
and, in general, there will be several orbitals (labelled by 
index n) with the same angular dependence, but differ- 
ent radial dependence, which is conventionally called a 
'multiple-^' basis. The radial functions are defined by a 
cubic spline interpolations^ from the values given on a 
fine radial mesh. Each radial function may have a dif- 
ferent cutoff radius and, up to that radius, its shape is 
completely free and can be introduced by the user in an 
input file. In practice, it is also convenient to have an au- 
tomatic procedure to generate sufficiently good basis sets. 
We have developed several such automatic procedures, 
and we will describe here one of them for completeness, 
even though we stress that the generation of the basis 
set, like that of the pseudopotential is to a large extent 



up to the user and independent of the Siesta method 
itself. 

In the case of a minimal (single-^) basis set, we have 
found corufpHient and efficient the method of Sankey and 
NiklcwskiLfEa Their basis orbitals are the cigenfunctions 
of the (pseudo)atom within a spherical box (although the 
radius of the box may be different for each orbital, see be- 
low). In other words, they are the (angular-momcntum- 
dependent) numerical eigenfunctions 4>i(r) of the atomic 
pseudopotential V/(r), for an energy e; + Sei chosen so 
that the first node occurs at the desired cutoff radius rf: 



I d 2 1(1 
2r"dr~ 2 ~ r+ 2r 2 



1) 



Vi{r) 



4>i{r) = (ej + Sei)4>i(r) 
(9) 



with 4>i(rf) — (we omit indices I and n here for simplic- 
ity) . In order to obtain a well balanced basis, in which the 
effect of the confinement is similar for all the orbitals, it 
is usually better to fix a common 'energy shift' <5e, rather 
than a common radius r c , for all the atoms and angular 
momenta. This means that the orbital radii depend on 
the atomic species and angular momentum. 

One obvious possibility for multiple-^ bases is to use 
pseudopotential eigenfunctions with an increasing num- 
ber of nodest2l. They have the virtue of being orthogonal 
and asymptotically complete. However, the efficiency of 
this kind of basis depends on the radii of confinement 
of the different orbitals, since the excited states of the 
pseudopotential are usually unbound. Thus, in practice 
we have found this procedure rather inefficient. Another 
possibility is to ase the atomic eigenstates for different 
ionization statest^. We have implemented a different 
schemetJ, based on the 'split- valence' method which is 
standard in quantum chemistryEJ. In that method, the 
first-C basis orbitals are 'contracted' (i.e. fixed) linear 
combinations of Gaussians, determined either variation- 
ally or by fitting numerical atomic eigenfunctions. The 
second-C orbital is then one of the Gaussians (generally 
the slowest-decaying one) which is 'released' or 'split' 
from the contracted combination. Higher-^ orbitals are 
generated in a similar way by releasing more Gaussians. 
Our scheme adapts this split-valence method to our nu- 
merical orbitals. Following the same spirit, our second-^ 
functions <f>i( r ) have the same tail as the first-£ orbitals 
(f>l (r) but change to a simple polynomial behaviour in- 
side a 'split radius' rf: 



r l (ai — bir 2 ) if r < rf 



if r > rf 



(10) 



where a/ and bi are determined by imposing the conti- 
nuity of value and slope at rf. These orbitals therefore 
combine the decay of the atomic eigenfunctions with a 
smooth and featureless behaviour inside rf. We have 
found it convenient to set the radius rf by fixing the 
norm of 0, in r > rf . We have found empirically that a 
reasonable value for this 'split-norm' is ~ 0.15. Actually, 
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instead of (ff" thus defined, we use (fyf — ffi, which is 
zero beyond r ; s , to reduce the number of nonzero matrix 
elements, without any loss of variational freedom. 

To achieve well converged results, in addition to the 
atomic valence orbitals, it is generally necessary to also 
include polarization orbitals, to account for the defor- 
mation induced by bond formation. Again, using pseu- 
doatomic orbitals of higher angular momentum is fre- 
quently unsatisfactory, because they tend to be too ex- 
tended, or even unbound. Instead, consider a valence 
pseudoatomic orbital 0; m (r) = <^(r)Y/ m (r), such that 
there are no valence orbitals with angular momentum 
I + 1. To polarize it, we apply a small electric field £ in 
the z-direction. Using first-order perturbation theory 



(H - E)5(f> = -(5H - SE)(f>, 



(11) 



where SH = £z and SE = (<fi\6H\(f>) = because 5H is 
odd. Selection rules imply that the resulting perturbed 
orbital will only have components with I' = J±l, mf = m: 

8H4> lm (r) = (£rcos(9))(Mr)Yi m (r)) 

= £r<f>i{r)(ci-\Yi-\, m + ci + iY l+1 . m ) (12) 



and 



k m (r) = (r)y ; _i im (r) + (pi +1 (r)Yi+i, m (r). (13) 



Since in general there will already be orbitals with an- 
gular momentum I — 1 in the basis set, we select the 
/ + 1 component by substituting ( jl2| ) and (|l3| ) in (11), 
multiplying by Y^ +l m (r) and integrating over angular 
variables. Thus we obtain the equation 



rl^_ r + (Z + !)(* + 2) 
2r dr 2 2r 2 

+ Vi(r) - Ei]i Pl+1 {r) = -rUr) (14) 

where we have also eliminated the factors £ and 
which affect only the normalization of ifii+i- The po- 
larization orbitals are then added to the basis set: 
0/+i,m(r) = N(pi +1 (r)Yi +lim (i), where TV is a normal- 
ization constant. 

We have found that the previously described proce- 
dures generate reasonable minimal single-^ (SZ) basis 
sets, appropriate for semiquantitative simulations, and 
double-^ plus polarization (DZP) basis sets that yield 
high quality results for most of the systems studied. We 
thus refer to DZP as the 'standard' basis, because it usu- 
ally represents a good balance between well converged 
results and a reasonable computational cost. In some 
cases (typically alkali and some transition metals), semi- 
core states also need-to be included for good quality re- 
sults. More recently^, we have obtained extremely ef- 
ficient basis sets optimized variationally in molecules or 
solids. Figure || shows the performance of these atomic 
basis sets compared to plane waves, using the same pseu- 
dopotentials and geometries. It may be seen that the SZ 
bases are comparable to planewave cutoffs typically used 
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FIG. 2: Comparison of convergence of the total energy with 
respect to the sizes of a plane wave basis set and of the LCAO 
basis set used by Siesta. The curve shows the total energy 
per atom of silicon versus the cutoff of a plane wave basis, cal- 
culated with a program independent of Siesta, which uses the 
same pseudopotential. The arrows indicate the energies ob- 
tained with different LCAO basis sets, calculated with Siesta, 
and the plane wave cutoffs that yield the same energies. The 
numbers in parentheses indicate the basis sizes, i.e. the num- 
ber of atomic orbitals or plane waves of each basis set. SZ: 
single-^ (valence s and p orbitals); DZ: double-^; TZ: triple-^; 
DZP: double-^ valence orbitals plus single-^ polarization d or- 
bitals; TZP: triple-^ valence plus single-^ polarization; TZDP: 
triple-^ valence plus double-^ polarization; TZTP: triple-^ va- 
lence plus triple-^ polarization; TZTPF: same as TZTP plus 
extra single-^ polarization / orbitals. 



in Car-Parrinello molecular dynamics simulations, while 
DZP sets are comparable to the cutoffs used in geom- 
etry relaxations and energy comparisons. As expected, 
the LCAO is far more efficient, tipically by a factor of 
10 to 20, in terms of number of basis orbitals. This ef- 
ficiency must be balanced against the faster algorithms 
available for plane waves, and our main motivation for 
using an LCAO basis is its suitability for O(N) meth- 
ods. Still, we have generally found that, even without 
using the O(N) functional, Siesta is considerably faster 
than a plane wave calculation of similar quality. 

Figure || shows the convergence of the total energy 
curve of silicon, as a function of lattice parameter, for 
different basis sizes, and table | summarizes the same 
information numerically. It can be seen that the 'stan- 
dard' DZP basis offers already quite well converged re- 
sults, comparable to those used in practice in most plane 
wave calculations. 

Figure || shows the dependence of the lattice constant, 
bulk modulus, and cohesive energy of bulk silicon with 
the range of the basis orbitals. It shows that a cutoff 
radius of 3 A for both s and p orbitals yields already very 
well converged results, specially when using a 'standard' 
DZP basis. 
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FIG. 3: Total energy per atom versus lattice constant for 
bulk silicon, using different basis sets, noted as in Fig. | PW 
refers to a very well converged (50 Ry cutoff) plane wave 
calculation. The dotted line joins the minima of the different 
curves. 



TABLE I: Comparisons of the lattice constant a, bulk mod- 
ulus B, and cohesive energy E c for bulk Si, obtained with 
different basis sets. The basis notation is as in Fig. |^. PW 
refers to a 50 Ry-cutoff plane wave calculation. The LAPW 
results were taken from ref. |37| and the experimental values 
from ref. Bq. 



Basis 


a (A) B (GPa) E c (eV) 


SZ 


5.521 


88.7 


4.722 


DZ 


5.465 


96.0 


4.841 


TZ 


5.453 


98.4 


4.908 


SZP 


5.424 


97.8 


5.227 


DZP 


5.389 


96.6 


5.329 


TZP 


5.387 


97.5 


5.335 


TZDP 


5.389 


96.0 


5.340 


TZTP 


5.387 


96.0 


5.342 


TZTPF 


5.385 


95.4 


5.359 


PW 


5.384 


95.9 


5.369 


LAPW 


5.41 


96 


5.28 


Expt. 


5.43 


98.8 


4.63 



IV. ELECTRON HAMILTONIAN 

Within the non-local-pseudopotential approximation, 
the standard Kohn-Sham one-electron Hamiltonian may 
be written as 



H = T 



■E 



■ylocal 



(r) 



J2Vi KB + V H (r)+V xc (r) 
i 




(15) 



5.0 6.0 7.0 
cutoff radii (a.u.) 



FIG. 4: Dependence of the lattice constant, bulk modulus, 
and cohesive energy of bulk silicon with the cutoff radius of 
the basis orbitals. The s and p orbital radii have been made 
equal in this case, to simplify the plot. PW refers to a well 
converged plane wave calculation with the same pseudopoten- 
tial. 



where T = — ^V 2 is the kinetic energy operator, / is an 
atom index, V (r) and V xc (r) are the total Hartree and 
XC potentials, and Vj ocal (r) and Vj KB are the local and 
non-local (Kleinman-Bylander) parts of the pseudopo- 
tential of atom /. 

In order to eliminate the long range of vj ocal , we screen 
it with the potential Vj atom , created by an atomic electron 
density pj tom , constructed by populating the basis func- 
tions with appropriate valence atomic charges. Notice 
that, since the atomic basis orbitals are zero beyond the 
cutoff radius rj — max;(rj ; ), the screened 'neutral-atom' 
(NA) potential Vj NA = V\ ocal + Vj atom is also zero beyond 
this radiuaj (see Fig. ||). Now let Sp(r) be the differ- 
ence between the self-consistent electron density p(r) and 
the sum of atomic densities p atom = J2iPi tom ' anc ^ ^ 
SV H (r) be the electrostatic potential generated by Sp(r), 
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which integrates to zero and is usually much smaller than 
/o(r). Then the total Hamiltonian may be rewritten as 

H = f + Y^ Vi KB + E V i NA ( r ) + 5V H {v) + V xc {r) 
i i 

(16) 

The matrix elements of the first two terms involve only 
two-center integrals which are calculated in reciprocal 
space and tabulated as a function of interatomic distance. 
The remaining terms involve potentials which are calcu- 
lated on a three-dimensional real-space grid. We consider 
these two approaches in detail in the following sections. 



V. TWO-CENTER INTEGRALS 

The overlap matrix and the largest part of the Hamilter 
nian matrix elements are given by two-center integrals^. 
We calculate these integrals in Fourier space, as proposed 
by Sankey and NiklewskiQ, but we use some implemen- 
tation details explained in this section. Let us consider 
first overlap integrals of the form 



5(R) = = / ^*(r)V2(r-R)dr, 



(17) 



where the integral is over all space and ipi , tp2 may be ba- 
sis functions 4>i m m KB pseudopotential projectors ximn, 
or more complicated functions centered on the atoms. 
The function S(R) can be seen as a convolution: we take 
the Fourier transform 



^(k) = 



1 



(2tt) 3 / 2 



(18) 



where we use the same symbol ip for ip(r) and ip(k), as its 
meaning is clear from the different arguments. We also 
use the planewave expression of Dirac's delta function, 
j el (k'-k)r dr = (27r) 3 (J(k'-k), to find the usual result 
that the Fourier transform of a convolution in real space 
is a simple product in reciprocal space: 



5(R) = / ^(k)V> 2 (k)e- jkR dk 



(19) 



Let us assume now that the functions ip( r ) can be ex- 
panded exactly with a finite number of spherical har- 
monics: 



'max I 



= E E Tpi™(r)Yi m (r), 



(20) 



1=0 m=-l 



^lm{r)= sm6d6 d V Y* n (e,^j(r,9, V ). (21) 
Jo Jo 

This is clearly true for basis functions and KB projec- 
tors, which contain a single spherical harmonic, and also 
for functions like xip(r), which appear in dipole matrix 



elements. We now substitute in (fL8) the expansion of a 
plane wave in spherical harmonicsc 



00 / 



eikr = J2J2 ^ l Ji(kr)Y; m (k)Y lm (r), (22) 



2=0 m=-i 



to obtain 



Imax I 



^ = E E ^m(fc)r im (k), 



1=0 m=-l 



(23) 



^m(&) = y-H) l J r'drjiikr^imir). (24) 
Substituting now (|f) and (H) into (|l|) we obtain 

2Z mox I 

S(R) = E Si m (R)Y lm {R) (25) 

1=0 m=-l 

where 

s lm (R) = E E G * 

(R), (26) 



Gi imi ,i 2 m. 2 dm = / sin9d9 / dip 

Y l * imi (e, v )Y hm2 (e, v )Y l : n (e 7l p),(27) 



S hmuh m 2 ,i(R) = ^ h ~ l2 ~ l / k 2 dkji(kR) 

Jo 

x i- h r hhmi ( k ) il2 ^,i 2 m 2 (k), (28) 

Notice that i~ ll tpi(k),i l2 ip 2 (k), and ih-h-l are a jj rea ^ 
since Zi — ^2 — ^ is even for all I's for which G; imii z 2m2i z m ^ 
0. The Gaunt coefficients Gi lTnit i 2Tn2i i m can be-obtained 
by recursion from Clebsch-Gordan coefficientaZI. How- 
ever, we use real spherical harmonics for computational 
efficiency: 



Y lm (0,<p) = C lm Pr (cos 6)x 



sin(miy9) if m < 
cos(mip) if m > 



(29) 



where P{ n (z) are the associatecL-legendre polynomials 
and Cim normalization constants^. This does not affect 
the validity of any of previous equations, but it modifies 
the value of the Gaunt coefficients. Therefore, we find 
it is simpler and more general to calculate Gi imi ,i 2 m 2 ,im 
directly froaa Eq. (p7|). To do this, we use a Gaussian 
quadrature^ 

/ sin 9d9 \ dip -> 4tt— Vwj sin 6>i — V (30) 
Jo Jo 6 1=1 f jZi 
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with N v = 1 + 3l max ,N e = 1 + mt(3l max /2), and the 
points cos#j and weights Wi are calculated as described 
in ref. [n| This quadrature is exact in equation ( p7| ) for 
spherical harmonics Y/ m (real or complex) of I < i max , 
and it can be used also to find the expansion of ^>(r) in 
spherical harmonics (eq. (pl|)). 

The coefficients Gi imit i 27n2 ^ m are universal and they 
can be calculated and stored once and for all. The func- 
tions Si imit i 2Tn2 ,i(R) depend, of course, on the functions 
^1,2 ( r ) being integrated. For each pair of functions, they 
can be calculated and stored in a fine radial grid up 
to the maximum distance R m ax = ?"i + r 2 a ^ which ipi 
and ip2 overlap. Their value at an arbitrary distance R 
can then be obtained very accurately using a spline in- 
terpolation. 

Kinetic matrix elements T(R) = (ip*\ — ^V 2 \ip2) can 
be obtained in exactly the same way, except for an extra 
factor k 2 in Eq. (EF 



-h-l 



^kUkUkR) 



x i- h r Uimi ( k y l2 ^hm 2 (k)- (31) 

Since we frequently use basis orbitals with a kinki, we 
need rather fine radial grids to obtain accurate kinetic 
matrix elements, and we typically use grid cutoffs of more 
than 2000 Ry for this purpose. Once obtained, the fine 
grid does not penalize the execution time, because the 
interpolation effort is independent of the number of grid 
points. It also affects very marginally the storage require- 
ments, because of the one-dimensional character of the 
tables. However, even though it needs to be done only 
once, the calculation of the radial integrals (pi}) , (28), 
and @ is not negligible if performed unwisely. We have 
developed a special fast radial Fourier transform for this 
purpose, as explained in appendix [B|. 

Dipole matrix elements, such as (ipi\x\ip2) , can also 
be obtained easily by defining a new function Xi(r) = 
xipi(r), expanding it using (pl|), and computing (xi)^) 
as explained above (with the precaution of using l max + 1 
instead of l ma x)- 



VI. GRID INTEGRALS 

The matrix elements of the last three terms of Eq. ( |l6| ) 
involve potentials which are calculated on a real-space 
grid. The fineness of this grid is controlled by a 'grid cut- 
off' E cut : the maximum kinetic energy of the planewayes 
that can be represented in the grid without aliasings. 
The short-range screened pseudopotentials Vf fA (r) in 
( |l~6| ) are tabulated as a function of the distance to atoms I 
and easily interpolated at any desired grid point. The last 
two terms require the calculation of the electron density 
on the grid. Let ^(r) be the Hamiltonian eigenstates, 
expanded in the atomic basis set 



(32) 



where c^i = and </> M is the dual orbital of 4*^: 

(</vl</v) = &nv We use the compact index notation 
/i = {limn} for the basis orbitals, Eq. (JsJ) - The elec- 
tron density is then 



P(r) = ^nil^i(r) 



(33) 



where n, is the occupation of state ipi. If we substitute 
(|32"|) into (|33|) and define a density matrix 



Ppv = 



(34) 



where Ci U = c* i; the electron density can be rewritten as 



(35) 



We use the notation 0* for generality, despite our use 
of real basis orbitals in practice. Then, to calculate the 
density at a given grid point, we first find all the atomic 
basis orbitals, Eq. (||), at that point, interpolating the 
radial part from numerical tables, and then we use ( |35| ) 
to calculate the density. Notice that only a small num- 
ber of basis orbitals are non-zero at a given grid point, 
so that the calculation of the density can be performed 
in 0(N) operations, once is known. The storage of 
the orbital values at the grid points can be one of the 
most expensive parts of the program in terms of memory 
usage. Hence, an option is included to calculate and use 
these terms on the fly, in the the spirit of a direct-SCF 
calculation. The calculation of itself with Eq. |34|) 
does not scale linearly with the system size, requiring in- 
stead the use of special 0{N) techniques to be described 
below. However, notice that in order to calculate the 
density, only the matrix elements p^ v for which M and 
4> v overlap are required, and they can therefore be stored 
as a sparse matrix of 0(N) size. Once the valence den- 
sity is available in the grid, ^we add to it, if necessary, 
the non-local core correctionEa, a spherical charge den- 
sity intended to simulate the atomic cores, which is also 
interpolated from a radial grid. With it, we find the 
exchange and correlation potential V xc (r), trivially in 
the LDA and using the method described in ref. for 
the GGA. To calculate 5V H (r), we first find p °^{ Y ) 
at the grid points, as a sum of spherical atomic densi- 
ties (also interpolated from a radial grid) and subtract it 
from p(r) to find Sp(r). We then solve Poisson's equa- 
tion to obtain 5V H (r) and find the total grid potential 
V(r) = V NA (r) + 6V H (r) + V xc (r). Finally, at every 
grid point, we calculate ^(r)^* (r)0„(r)Ar 3 for all pairs 

§iit 4>v which are not zero at that point (Ar 3 is the vol- 
ume per grid point) and add it to the Hamiltonian matrix 
element H^ v . 

To solve Poisson's equation and find SV H (r) we nor- 
mally use fast Fourier transforms in a unit cell that is 
either naturally periodic or made artificially periodic by 
a supercell construction. For neutral isolated molecules, 
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our use of strictly confined basis orbitals makes it triv- 
ial to avoid any direct overlap between the repeated 
molecules, and the electric multipole interactions de- 
crease rapidly with cell size. For charged molecules we 
supress the G = Fourier component (an infinite con- 
stant) of the potential created by the excess of charge. 
This amounts to compensating this excess with a uni- 
form charge background. We then use the method of 
Makov and Payndl3 to correct the total energy for the 
interaction between the repeated cells. Alternatively, we 
can solve Poisson's equation by the multigrid method, 
using finite differences and fixed boundary conditions, 
obtained from the multipole expansion of the molecular 
charge density. This can be done in strictly O(N) opera- 
tions, unlike the FFT's, which scale as N log N. However, 
the cost of this operation is typically negligible and there- 
fore has no influence on the overall scaling properties of 
the calculation. 

Figures || and ^ show the convergence of different mag- 
nitudes with respect to the energy cutoff of the integra- 
tion grid. For orthogonal unit cell vectors this is simply, 
in atomic units, E cu t = (ir/Ax) 2 /2 with Ax the grid 
interval. 





100 300 

e c ( R v) 



FIG. 6: Same as Fig. g for the total energy and pressure 
of bulk iron. This is presented as a specially difficult case 
because of the very hard partial core correction (r m = 0.7 au) 
required for a correct description of exchange and correlation. 



repeated twice in an almost independent way: only to 
calculate V xc (r) need they- he, combined. However, in 
the non-collinear spin caseOEaocH, the density at every 
point is not represented by the up and down values, but 
also by a vector giving the spin direction. Equivalcntly, 
it may be represented by a local spin density matrix 

p^(r) = ]>>^f (r)C(r) = 5>##( r M*to ( 36 ) 

i \LV 



10 30 50 70 90 

E c (Ry) 



20 60 100 140 180 

e c (Ry) 



j8 



(37) 



(38) 



FIG. 5: (a) Convergence of the total energy and pressure in 
bulk silicon as a function of the energy cutoff E cu t of the real 
space integration mesh. Circles and continuous line: using a 
grid-cell-sampling of eight refinement points per original grid 
point. The refinement points are used only in the final calcula- 
tion, not during the self-consistency iteration (see text). Tri- 
angles: two refinement points per original grid point. White 
circles: no grid-cell-sampling, (b) Bond length and angle of 
the water molecule as a function of E cu t 



where a, (3 are spin indices, with up or down values. The 
coefficients are obtained by solving the generalized 
eigenvalue problem 



E 



{H a J - EiS MV S^)4 



(39) 



where Hffl, like p%$, is a (2N x 2N) matrix, with N the 
number of basis functions: 



H af3 



,|T + V KB + V NA (r) + 5V H (r) + Vf c (r)\<j>„). 

(40) 



VII. NON-COLLINEAR SPIN xhis is in contrast to the collinear spin case, in which 

the Hamiltonian and density matrices can be factorized 
In the usual case of a normal (collinear) spin into two N x N matrices, one for each spin direction. To 
polarized system, there are two sets of values for calculate V^{r) we first diagonalize the 2x2 matrix 
V'i(r), P/ii/j p( r )> V xc (r), and H^, one for spin up and an- p a ^(v) at every point, in order to find the up and down 
other for spin down. Thus, the grid calculations can be spin densities (r), (r) in the direction of the local spin 
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vector. We then find V]r C (r),V^- c (r) in that direction, 
with the usual local spin density functionaLJ and we ro- 
tate back Vxc( r ) to the original direction. Thus, the 
grid operations are still basically the same, except that 
they need now be repeated three times, for the Tt> 44 
and |4 components. Notice that p af3 (r) and V^(r) are 
locally Hermitian, while and p^£ are globally Her- 
mitian (H^" — H"{**), so that their |f components can 
be obtained from the |4 ones. 



VIII. BRILLOUIN ZONE SAMPLING 

Integration of all magnitudes over the Brillouin zone 
(BZ) is essential for small and moderately large unit cells, 
especially of metals. Although Siesta is designed for 
large unit cells, in practice it is very useful, especially 
for comparisons and checks, to be able to also perform 
calculations efficiently on smaller systems without using 
expensive superlattices. On the other hand, an efficient 
/c-sampling implementation should not penalize, because 
of the required complex arithmetic, the T-point calcula- 
tions used in large cells. A solution used in some pro- 
grams is to have two different versions of all or part of 
the code, but this poses extra maintenance requirements. 
We have dealt with this problem in the following way: 
around the unit cell (and comprising itself) we define an 
auxiliary supercell large enough to contain all the atoms 
whose basis orbitals are non-zero at any of the grid points 
of the unit cell, or which overlap with any of the basis 
orbitals in it. We calculate all the non-zero two-center 
integrals between the unit cell basis orbitals and the su- 
percell orbitals, without any complex phase factors. We 
also calculate the grid integrals between all the supercell 
basis orbitals (j)^ and (j) v n (primed indices run over all 
the supercell), but within the unit cell only. We accumu- 
late these integrals in the corresponding matrix elements, 
thus making use of the relation 

< <^mr)|4v >= E < <P^\V(r)f(T)\^, > . 

(li'u") = (fiv') 

(41) 

f(r) = 1 for r within the unit cell and is zero other- 
wise. </> M is within the unit cell. The notation pi = fx 
indicates that <^,< and 4>^ are equivalent orbitals, re- 
lated by a lattice vector translation, {p'v") = {pv 1 ) 
means that the sum extends over all pairs of supercell 



orbitals c/y and 



such that p! — p, v" = v' , and 



— Ri,' = R M ' — R„". Once all the real overlap and 
Hamiltonian matrix elements are calculated, we multiply 
them, at every fc-point by the corresponding phase fac- 
tors and accumulate them by folding the supercell orbital 
to its unit-cell counterpart. Thus 



where <j)^ and <p v are within the unit cell. The resulting 
N x N complex eigenvalue problem, with TV the num- 
ber of orbitals in the unit cell, is then solved at every 
sampled k point, finding the Bloch-state expansion coef- 
ficients c M i(k): 



^(k,r)=E elkRM '^'( r )V^k) 



(43) 



where the sum in p! extends to all basis orbitals in space, 
i labels the different bands, c^n = c^i if p! = p, and 
"0i(k, r) is normalized in the unit cell. 
The electron density is then 

p(r) = J2 I MkM(k,r)| 2 dk 

■ JBZ 



pj v' 



(44) 



where the sum is again over all basis orbitals in space, 
and the density matrix 



V / c^(k)n 4 (k) Cw (k)e i;k ( R "- R ^dk (45) 

„ J BZ 



is real (for real </> M 's) and periodic, i.e. p^ = p^> v > if 
(v, /i) = (v',p') (with '=' meaning again 'equivalent by 
translation'). Thus, to calculate the density at a grid 
point of the unit cell, we simply find the sum (44) over 
all the pairs of orbitals <pp. > 4>v in the supercell that are 
non-zero at that point. 

In practice, the integral in ( [45| ) is performed in a finite, 
uniform grid of the Brillouin zone. The fineness of this 
grid is controlled by a fc-grid cutoff l cu t , a real-space ra- 
dius which plays a role-equivalent to the planewave cutoff 
of the real-space gridcS. The origin of the fc-grid may be 
displaced from k = Ojji order to decrease the number of 
inequivalent fc-pointsc3. 

If the unit cell is large enough to allow a T-point-only 
calculation, the multiplication by phase factors is skipped 
and a single real-matrix eigenvalue problem is solved (in 
this case, the real matrix elements are accumulated di- 
rectly in the first stage, if multiple overlaps occur). In 
this way, no complex arithmetic penalty occurs, and the 
differences between T-point and fc-sampling are limited to 
a very small section of the code, while all the two-center 
and grid integrals use always the same real-arithmetic 
code. 



IX. TOTAL ENERGY 

The Kohn-ShamB total energy can be written as a 
sum of a band-structure (BS) energy plus some correction 
terms, sometimes called 'double count' corrections. The 
BS term is the sum of the energies of the occupied states 



4>i 



E 



H„ 



(42) 



E BS = E nii^H^i) = E H^pvp = Tr(Hp) (46) 
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where spin and /c-sampling notations are omitted here for 
simplicity. At convergence, the ip^s are simply the eigen- 
vectors of the Hamiltonian, but it is important to realize 
that the Kohn-Sham functional is also perfectly well de- 
fined outside this so-called 'Born-Oppcnhcimcr surface', 
i.e. it is defined for any set of orthonormal tpi' s - The 
correction terms are simple functionals of the electron 
density, which can be obtained from equation (^5|), and 
the atomic positions. The Kohn-Sham total energy can 
then be written as 



E KS = 



E^WV-^J / V H {v)p{v)<fv 



- I (e xc (r) - V xc (r))p(r)d 3 r 
ZiZj 



E 

KJ 



Ru 



(47) 



where /, J are atomic indices, Ru = |Rj — Rj|, Zi, Zj 
are the valence ion pseudoatom charges, and e xc (r)p(r) 
is the exchange-correlation energy density. In order to 
avoid the long range interactions of the last term, we 
construct from the local-pseudopotential V}° , which 
has an asymptotic behavior of —Zi/r, a diffuse ion 
charge, pj ooa '(r), whose electrostatic potential is equal 



to V} ocal 



(r): 



(48) 



Notice that we define the electron density as positive, 
and therefore p l ° cal < 0. Then, we write the last term in 



as 



E §^ = \ E u\T\Ru) + E w\T\Ru) 



Ru 2 

KJ J IJ 



KJ 



E^° 



(49) 



where Ujf is the electrostatic interaction between the 
diffuse ion charges in atoms / and J: 

Uj f al (\R\) = J V} ocal (r)p l J cal (r - R)d 3 r, (50) 

gjjiocai - 1S a sma u short-range interaction term to correct 
for a possible overlap between the soft ion charges, which 
appears when the core densities are very extended: 



5u local 



(R) 



ZjZj 
R 



Ufj (R) , 



(51) 



and u\ ocal is the fictitious self interaction of an ion charge 
(notice that the first right-hand sum in (|i^) includes the 
I = J terms): 



Uj cal = i[/j5 ca/ (o) = - / r/'" ,, "iy)///"'"'[/-'i47r/--j/-. 



(52) 



Defining pf A from Vj NA , analogously to p l [ c °\ we have 
that pf A = p l j ocal + pf om , and equation @ can be 
transformed, after some rearrangement of terms, into 

e ks = E( T - + ^ s w + ^E^/(^) 

tiv I J 

+ J2 6u iT l (Ru)~J2 u i ocal 

KJ I 

+ J V NA {v)5p{v)d 3 v (53) 
+ \\ SvB (r)8p(r)(f t T + J e xc (r)p(i-)d 3 r 
where V NA = J2i Vf A and Sp = p-J2i Pj A - 
U? A {R) = J ^(rV^r-R^r 

= __L J V I NA (r)V 2 Vj NA {r-R,)d 3 r(5<i) 

is a radial pairwise potential that can be obtained from 
yNA^ ag a two-center integral, by the same method 
described previously for the kinetic matrix elements: 



T, 



-i|^(r)V 2 ^(r-R^)d 3 r (55) 



V^ V B is also obtained by two-center integrals: 



(56) 



where the sum is over all the KB projectors \a that over- 
lap simultaneously with 4>n and <j> v . 

Although ([53|) is the total energy equation actually 
used by Siesta, its meaning may be further clarified if 
the I — J terms of \ J2ij ^ij r (-Rjj) a re combined with 
J2i U\ ocal to yield 



e ks = e^ + ^w + e^/^jj) 

ytv KJ 

+ E &u¥f*{Rij) + E u i om 

KJ I 

+ J V NA (r)Sp(r)d 3 r (57) 
+ \j 6vH ( r ) 6 P( r ) d3r + J e xc (r)p(r)d 3 r 



where 

jatora 



Uf 



Vi oca \r) + -Vf om (r) J pf om {r)Anr 2 dr 
n \ 2 / 

(58) 



is the electrostatic energy of an isolated atom. 
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The last three terms in Eq. ( p3[ ) are calculated using 
the real space grid. In addition to getting rid of all long- 
range potentials (except that implicit in 5V H (r)), the 
advantage of (|53| ) is that, apart from the relatively slowly- 
varying exchange-correlation energy density, the grid in- 
tegrals involve Sp(r), which is generally much smaller 
than p(r). Thus, the errors associated with the finite 
grid spacing are drastically reduced. Critically, the ki- 
netic energy matrix elements can be calculated almost 
exactly, without any grid integrations. 

It is frequently desirable to introduce a finite electronic 
temperature T and/or a fixed chemical potential p, cither 
because of true physical conditions or to accelerate the 
self-consistency iteration. Th.en.-the functional that must 
be minimized is the free energyE2l 

F(Ri, ^(r), m) = E KS (-R!, ipi(r),ni) - ^ m 

i 

-k B Tj2(n, logn, + (1 - m) log(l - m)). (59) 

i 

Minimization with respect to yields the usual Fermi- 
Dirac distribution n t = 1/(1 + e ^-^/ kBT ). 

X. HARRIS FUNCTIONAL 

We will mention here a special use o f the , Harris energy 
functional, that is generally defined astirEa 



r(r)r(r') d3rd3r , 



r - r' 



+ /( £ r c (r)-C(r))p m (r)d 3 r+^§^ (60) 
J ftj RlJ 

where H m is the KS Hamiltonian produced by a trial 
density p ln and ip° ut are its eigenvectors (which in gen- 
eral are different from those whose density is p m ). As 
in Eq. (46), the first term in (p0|) can be written as 



Tr(H m p° m ), and the rest are the so-called 'double count 



corrections'. An important advantage of eq. (|60|) is that 
it does not require p\ n to be obtained from a set of or- 
thogonal electron states ip™ , and in fact p m is frequently 
taken as a simple superposition of atomic densities. How- 
ever, we will assume here that the states ipj n are indeed 
known. In this case, the Kohn-Sham energy E [p m ], 
Eq. ([l?]), obeys exactly the same expression (^), except 
that ip° ut and n° ut must be replaced by tpl n and n- n . 
Thus, a simple subtraction gives 

E Harris r ini j^KS[ inl i \ ^ rjin ( out in \ 
[P \ = E IP \+ Z^ H uu {Pau ~ Pav) ■ (61) 



Generally the Harris functional is used nonself- 
consistently, with a trial density given by the sum of 
atomic densities. But here we want to comment on its 



usefulness to improve dramatically the estimate of the 
converged total energy, by taking p™ as the density ma- 
trix of the (n— l)'th self-consistency iteration and of 
the n'th iteration. In fact, E Hams frequently gives, after 
just two or three iterations, a better estimate than E KS 
after tens of iterations. Unfortunately, we have found 
that there is hardly any improvement in the convergence 
of the atomic forces thus estimated, and therefore the 
self-consistent Harris functional is less useful for geome- 
try relaxations or molecular dynamics. 



XI. ATOMIC FORCES 



Atomic forces and stresses are obtained by direct dif- 
ferentiation of (|53| ) with respect to atomic positions. 
They are obtained simultaneously with the total energy, 
mostly in the same places of the code, under the general 
paradigm "a piece of energy => a piece of force/stress" 
(except that some pieces are calculated only in the last 
self-consistency step). This ensures that all force contri- 
butions, including Pulay corrections, are automatically 
included. The force contribution from the first term in 
@ is 



<9R/ 



E( T ^ + ^ B w = 



//;/ 



dR 



-2W9 v KB dSau n 



dR^ ah/ 



(62) 



where a are KB projector indices, £ I indicates orbitals 
or KB projectors belonging to atom J, and we have con- 
sidered that 



OS, 



dRr 



8Rr 



dR, f 



(63) 



where R/ (i is the position of atom I u , to which orbital 
4> a belongs and R nu = R/„ - R/^ . 

Leaving aside for appendix ^| the terms contain- 
ing dp vu / dRi, the other derivatives can be obtained 
by straightforward differentiation of their expansion in 
spherical harmonics (Eq. (|25|)). However, instead of us- 
ing the spherical harmonics Y; m (r) themselves, it is con- 
venient to multiply them by r , in order to make them 
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analytic at the origin. Thus 



jV(R) 
dR 



im 



R) 



R l Y lm {R) 



R l 



(64) 



In fact, it is SjZ(R)/R l , rather than S^(-R), that is 
stored as a function of J? on a radial grid. Its deriva- 
tive, d(Sj^(R)/R l )/dR, is then obtained fr om the same 
cubic spline interpolation used for the value itself. The 
value and gradient of R l Yi m (R) are calculated analyt- 
ically from explicit formulae (up to I = 2) or recur- 
rence relational. Entirely analogous equations apply to 
dT^v/dRn 



The second and third terms in Eq. (53) are sim- 
ple interatomic pair potentials whose force contributions 
are calculated trivially from their radial spline inter- 
polations. The fourth term is a constant which does 
not depend on the atomic positions. Taking into ac- 
count that V NA (r) = XifV/ M (r-R;), and therefore 
dV NA {r)/dHi = -VV I NA (r - Rj), the force contribu- 
tion from the fifth term is 



d 



OR 



- f V NA (r)5p(r)d s r = 
i J 



(65) 



VV I NA (r)6p(r)d i r + / V NA {v) 



rNA f ^ d5 P( r l d 3 r 



flRi 

The sixth term is the electrostatic self-energy of the 
charge distribution Sp(r): 

J 5V H (r)5p(r)d s r = J SV H (r)^±d s r (66) 

In the last term, we take into account that d(pe xc ) / dp = 



dKr 



=(r)p(r)d 3 r = / V xc (r) 



ORr 



(67) 



Now, using Eq. (|35| ) and that, for v £ I, d<p u (r)/dRi = 
— V^j,, the change of the self-consistent and atomic den- 
sities are 



dp(r) 
<9R/ 



2Re EE^^WV0,(r) 

M i>ei 



(68) 



dp atom (r) 



-2ReEp?; m ^( r ) V ^(r) (69) 
net 



where we have taken into account that the density matrix 
of the separated atoms is diagonal. Thus, leaving still 



aside the terms with dp va /dRi, the last term in Eq. (|6E_ 
as well as those in ( |66| ) and y@7\), have the general form 



Re EE^ / nr)^(r)V<Mr)d 3 r 

= Rc EE^^i^«iv^). 



(70) 



These integrals are calculated on the grid, in the same 
way as those for the total energy (i. e. (0 M |V(r) . 
The gradients V^v(r) at the grid points are obtained 
analytically, like those of <j> v (r) from their radial grid in- 
terpolations of <p(r)/r l : 



Y7A n d ( <l>nn(r)\ w 
V0 7im „(r) = — I -^^) r Y lm {v)v 



<t>iin(r) 



V(r l Y lm (r)). 



(71) 



In some special cases, with elements that require hard 
partial core corrections or explicit inclusion of the semi- 
core, the grid integrals may pose a problem for geometry 
relaxations, because they make the energy dependent on 
the position of the atoms relative to the grid. This 'egg- 
box effect' is small for the energy itself, and it decreases 
fast with the grid spacing. But the effect is larger and 
the convergence slower for the forces, as they are pro- 
portional to the amplitude of the energy oscillation, but 
inversely proportional to its period. These force oscilla- 
tions complicate the force landscape, especially when the 
true atomic forces become small, making the convergence 
of the geometry optimization more difficult. Of course, 
the problem can be avoided by decreasing the grid spac- 
ing but this has an additional cost in computer time and 
memory. Therefore, we have found it useful to minimize 
this problem by recalculating the forces, at a set of po- 
sitions, determined by translating the whole system by 
a set of points in a finer mesh. This procedure, which 
we call 'grid-cell sampling', has no extra cost in memory. 
And since it is done only at the end of the self-consistency 
iteration, for fixed p^, it has only a moderate cost in 
CPU time. 

At finite temperature, the forces are really the deriva- 
tives of the free energy with respect to atomic displace- 
ments since 



dF(R/,^(r),n;) OF 



dR! 



E 



dF 



dRi 



EOF diii 
drudR^ 

i 

dE 



dV?(r) dRi 



d y = 



dRi 



(72) 



In this particular equation we have used the notation 
d/dRi, as opposed to d/dRi, to indicate the inclusion 
of the change in tpi(r) and rii when we move the atom, 
in calculating the derivative. But we have used also that 
dF/drti = dF/ dipi (r) = and that the last two terms in 
(p9|) do not depend on R/, so that DF/dRi = dE/dRi. 
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The latter are the atomic forces actually calculated. No- 
tice, however, that dE/dHj ^ dE/dRj, so that the cal- 
culated forces are indeed the total derivatives of the free, 
not the internal energy. 

We would like to also mention the calculation of forces 
using the non-self-consistent Harris functional, in which 
the 'in' density is a superposition of atomic densities. We 
have implemented this as an option for 'quick and dirty' 
calculations because, used with a minimal basis set, it 
makes Siesta competitive with tight binding methods, 
which are much faster than density functional calcula- 
tions. The problem that we address here is that, al- 
though E Harns is stationary with respect to p out , it is 



not so with respect to p l 
a force term 



In particular, there appears 



8Rj 



p out (r)d 3 r. 



(73) 



A similar term appears for the electrostatic interaction 
between the input and output density, but it presents 
no special problems because of the linear character of 
the Hartree potential. However, evaluation of (|73| ) re- 
quires the change of the exchange-correlation potential 
with density, a quantity also required to evaluate the lin- 
ear response of the electron gas, but not in normal energy 
and force calculations. Finally, notice that, apart from 
this minor difficulty, the Harris-functional forces are per- 
fectly well defined at the first iteration only. For later 
iterations (but still not converged) there is no practi- 
cal way to calculate <9p m /dRi and, without the help of 
Hellman-Feynmann's theorem (which applies only at con- 
vergence), the forces are not well defined. Of course, the 
omission of the terms depending on this quantity pro- 
duces an estimate of the forces, but we have found that 
their convergence is not appreciably faster than those es- 
timated from the Kohn-Sham functional. 



XII. STRESS TENSOR 

Wc define the stress tensor as the positive derivative 
of the total energy with respect to the strain tensor 



0~af3 



dE KS 
de, 



a/3 



(74) 



where a, (3 are Cartesian coordinate indices. To trans- 
late to standard units of pressure, we must simply divide 
by the unit-cell volume and change sign. During the de- 
formation, all vector positions, including those of atoms 
and grid points (and of course lattice vectors), change 
according to 



origin gets displaced according to (f75|). From this equa- 
tion, we find that 



dr-, 



de 



a/3 



= S ia r/3 



(76) 



The change in E KS is essentially due to these position 
displacements, and therefore the calculation of the stress 
is almost perfectly parallel to that of the atomic forces, 
thus being performed in the same sections of the code. 
For example: 



de, 



a/3 



E 



8T <9r 7 
\ dr% v de a/3 <9r«„ 



(JLIS 



(77) 



Since dT^/dr^ is evaluated to calculate the forces, it 
takes very little extra effort to multiply it also by r@ 



for the stress. Equally, force contributions like (|7i 
their obvious stress counterpart 



have 



(78) 



[At/ 



However, there are three exceptions to this parallelism. 
The first concerns the change of the volume per grid point 
or, in other words, the Jacobian of the transformation 
( |75| ) in the integrals over the unit cell. This Jacobian is 
simply 8 a f}, and it leads to a stress contribution 



J \V NA {r) + ^SV H (r)^j Sp(r)d 3 r + E x 



S a p (79) 



Notice that the the renormalization of the density, re- 
quired to conserve the charge when the volume changes, 
enters through the orthonormality constraints, to be dis- 
cussed in appendix |a[ The second special contribu- 
tion to the stress lies in the fact that, as we deform 
the lattice, there is a change in the factor 1/ |r — r'| of 
the electrostatic energy integrals. We deal with this 
contribution in reciprocal space, when we calculate the 
Hartree potential by FFTs, by evaluating the derivative 
of the reciprocal-space vectors with respect to e a p. Since 



_d 1_ 



2G a G[ z 
G 4 



(80) 



Finally, the third special stress contribution arises in 
GGA exchange and correlation, from the change of the 
gradient of the deformed density p(r) — > p(r'). The treat- 
ment of this contribution is explained in detail in refer- 
ence fO. 



0=1 



(<W + e a ff) rp 



(75) 



Electric Polarization 



The shape of the basis functions, KB projectors, and 
atomic densities and potentials do not change, but their 



The calculation of the electric polarization, as an in- 
tegral in the grid across the unit cell, is standard and 



14 



almost free for molecules, chains and slabs (in the direc- 
tions perpendicular to the chain axis, or to the surface). 
For bulk systems, the electric polarization cannot be 
found from the charge distribution in the unit cell alone. 
In this case-JKe-jieed the so-called Berry-phase theory of 
polarizatiorSa, which-allows to compute quantitiepiike 
the dynamical chargesEJ and piezoeletric constants^ — . 
Here we comment some details of our implementations! . 

If R Q are the lattice vetors and P e = J2a=i f^Ra is 
the electronic contribution to the macroscopic polariza- 
tion, then we have 



'a ' P f 

2e 



dk G Q • ^7*(k,k') 



(81) 



k'=k 



where G a is the corresponding reciprocal lattice vector, 



e is the electron charge, «i(k, r) 



r ipi(k, r) is the 



periodic part of the Bloch function, and the factor of two 
comes from the spin degeneracy. The quantum phase 
$(k, k') is defined as 



$(k,k') = Im[ln(det(u i (k,r)|u j (k',r)))] 



(82) 



The derivative in mW) depends on a gauge that must 



be chosen such that u(k + G, r) 



-iGt 



r «(k, 



In 

practice, the integral is replaced by a discrete summa- 
tion, and a finite-difference approximation is made for the 



derivative^: Ak Q ^$(k,k'; 

$(k, k - Ak a )], where Ak Q : 
comes, for a = 1 



±[$(k,k+Ak Q )- 



G a /N a . Then (|SlJ) be- 



Gi • P e 



2c: 



Nn-l,N 3 —l Ni-1 



Yl E *(k llJ2J3 ,k ll+ i l 



22—0,23—0 i\— 



(83) 



where we have split the sum to stress the fact that we 
have a two-dimensional integral in the plane defined by 
G2 and G3, and a linear integral along Gi. Due to the 
approximation in the derivative, the linear integral usu- 
ally requires a finer mesh than the surface integral. To 
evaluate $(k, k + Ak) we use our LCAO basis: 

(«i(k)| Ui (k + Ak)) = (^(k)|e- lAk - r |^-(k + Ak)> = 
E.E^k^k+Ak) (84) 



-ik-(R„-R„ 



-iAk.(r-R„,) 



' J l</y 



Formulas si mi l ar to ( p4f) have been implemented by sev- 
eral authorscSH, mainly in the context of Hatree-Fock 
calculations, in which the basis orbitals are expanded 
in gaussians whose matrix elements can be found ana- 
lyticallyo. Our numerical, localized pseudo-atomic ba- 
sis orbitals are not well suited for a gaussian expan- 
sion. Instead, we expand the plane-waves appearing 



m equation 
1 - iAk ■ (r 



to first order in Ak, e 



-iAk-(r-R„ 



R„ 



C(Afc 2 ), and then we calculate the 
matrix elements of the position operator as explained in 



section^. It is interesting to note that, since the dis- 
cretized formula ( jS3|) only holds to 0(Ak 2 ), the approxi- 
mation of the matrix elements in ( [84] ) does not introduce 
any further errors in the calculation of the polarization. 
In a symmetrized version, we approximate equation m4 



]T5> w (k) V3 (k + Ak) f 



_i(k+At)-(R„-R 



Ak 

-i— ■ «<M(r - R,)l^'> + (0.1 (r - *V)I<V» 1(35) 



XIII. ORDER-iV FUNCTIONAL 



The basic problem for solving the Kohn-Sham equa- 
tions in 0(N) operations is that the solutions (the Hamil- 
tonian eigenvectors) are extended over the whole system 
and overlap with each other. Just to check the orthogo- 
nality of iV trial solutions, by performing integrals over 
the whole system, involves ~ N 3 operations. Among the 
different methods proposed to solve this jjtrnhl^nJj'EI, we 
have chosen the localized-orbital approachDEJCa because 
of its superior efficiency for nonp<pthogonal basis sets. 
The initially proposed functionalclCj used a fixed num- 
ber of occupied states, equal to the number of electron 
pairs, and it was found to have numerous local minima in 
which the electron configuration was easily trapped. A 
revised functional formE^I which uses a larger number of 
states than electron pairs, with variable occupations, has 
been found empirically to avoid the local minima prob- 
lem. This is the functional that we use and recommend. 

Each of the localized, Wannier-like states, is con- 
strained to its own localization region. Each atom / is 
assigned a number of states equal to int(ZJ aZ /2+l) so 
that, if doubly occupied, they can contain at least one 
excess electron (they can also become empty during the 
minimization of the energy functional). These states are 
confined to a sphere of radius R c (common to all states) 
centered at R/. More precisely, the expansion (Eq. (|32])) 
of a state ipi centered at R/ may contain only basis or- 
bitals M centered on atoms J such that |R/,/| < R c - 
This implies that ^>i(r) may extend to a maximum range 
R c + r c nax , where r 1 c nax is the maximun range of the ba- 
sis orbitals. For covalent systems, a localization region 
centered on bonds rather than atoms is more efficientEHI 
(it leads to a lower energy for the same i? c ), but it is 
less suitable to a general algorithm, especially in case of 
ambiguous bonds. Therefore, we generally use the atom- 
centered localization regions. 

In the method of Kim, Mauri, and Galli (KMG)El, the 
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band-structure energy is rewritten as: 



E 



KMG 



ij 



(86) 



Where &. 



rjSn V , and we have assumed a non-magnetic solution with 
doubly occupied states. The 'double count' correction 
terms of Eq. ( |47| ) remain unchanged and the electron 
density is still defined by (|35|), but the density matrix is 
re-defined as 

= 4 ^ C ^i C iv ~ 2 X! C ^aS a f3Cl3 C ov (87) 
i ij a/3 

The parameter r\ in Eq. ( p^ ) plays the role of a chemi- 
cal potential, and must be chosen to lie within the band 
gap between the occupied and empty states. This may be 
tricky sometimes, since the electron bands can shift dur- 
ing the self-consistency process or when the atoms move. 
In general, the number of electrons will not be exactly 
the desired one, even if r\ is within the band gap, because 
the minimization of ( |8^ ) implies a trade-off in which the 
localized states become fractionally occupied. To avoid 
an infinite Hartree energy in periodic systems, we simply 
renormalize the density matrix so that the total electron 
charge Yluu ^nuPvn equals the required value. 

For a given potential, the functional (^6|) is minimized 
by the conjugate-gradients method, using its derivatives 
with respect to the expansion coefficients 



dE 



KMG 



dc 



if. I 



-2^^ ( S^CvjCjaSHapCfJi 
j uflv 



(88) 



The minimization proceeds without need to orthonor- 
malize the electron states tpi. Instead, the orthogonal- 
ity as well as the correct normalization (one below r\ 
and zero above it) result as a consequence of the min- 
imization of E KMG . This is because, in contrast to 
the KS functional. E KMG is designed to penalize any 
nonorthogonalityLl The KS ground state, with all the 
occupied ^'s orthonormal, is also the minimum of (86), 
at which E KMG — E KS . If the variational freedom is 
constrained by the localization of the ipi's, the orthogo- 
nality cannot be exact, and the resulting energy is slightly 
larger than for unconstrained wavefunctions. In insula- 
tors and semiconductors, the Wannier functions are ex- 
ponentially localizedEJ, and the energy excess due to their 



strict localization decreases rapidly as a function of the 
localization radius R c , as can be seen in Fig. 0. 
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FIG. 7: Convergence of the lattice constant, bulk modulus, 
and cohesive energy as a function of the localization radius 
R c of the Wannier-like electron states in silicon. We used a 
supercell of 512 atoms and a minimal basis set with a cutoff 
radius r c — 5 a.u. for both s and p orbitals. 

If the system is metallic, or if the chemical potential 
is not within the band gap (for example because of the 
presence of defects) , the KMG functional cannot be used 
in practice. In fact, although some. 0(N) methods can 
handle metallic systems in principleO, we are not aware of 
any practical calculations at a DFT level. In such cases 
we copy the Hamiltonian and overlap matrices to stan- 
dard expanded arrays and solve the generalized eigen- 
value problem by conventional order- TV 3 diagonalization 
tcchniquesE-3. However, even in this case, most of the op- 
erations, and particularly those to find the density and 
potential, and to set up the Hamiltonian, are still per- 
formed in O(N) operations. 

Irrespective of whether the O(N) functional or 
the standard diagonalization is used, an outer self- 
consistency iteration is required, in which the density 
matrix is updated using Pulay's Residual Metric Mini- 
mization by Direct Ipyeision of the Iterative Subspace 
(RMM-DIIS) methodE3'E3. Even when the code is strictly 
O(N), the CPU time may increase faster if the number 
of iterations required to achieve the solution increases 
with N. In fact, it is a common experience that the re- 
quired number of selfconsistency iterations increases with 
the size of the system. This is mainly because of the 
'charge sloshing' effect, in which small displacements of 
charge from one side of the system to another give rise 
to larger changes of the potential, as the size increases. 
Fortunately, the localized character of the Wannier-like 
wavefunctions used in the O(N) method help to solve 
also this problem, by limiting the charge sloshing. Ta- 
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TABLE II: Average number of selfconsistency (SCF) iter- 
ations (per molecular dynamics step) and average number 
of conjugate- gradient (CG) iterations (per SCF iteration) re- 
quired to minimize the O(N) functional, during a simulation 
of bulk silicon at ~ 300 K. We used the Verlet methodEZI at 
constant energy, with a time step of 1.5 fs, and a minimal 
basis set with a cutoff radius r c — 5 a.u. R c is the local- 
ization radius of the Wannier-like wavefunctions used in the 
O(N) functional (see text). N is the number of atoms in the 
system. 
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ble y presents the average number of iterations required 
to minimize the O(N) functional and the average number 
of selfconsistency iterations, during a molecular dynam- 
ics simulation of bulk silicon at room temperature. It can 
be seen that these numbers are quite small and that they 
increase very moderately with system size. As might be 
expected, the number of minimization iterations increase 
with the localization radius, i.e. with the number of de- 
grees of freedom (c M i coefficients) of the wavefunctions. 
But this increase is also rather moderate. 

Figure || shows the essentially perfect 0{N) behaviour 
of the overall CPU time and memory. This is not surpris- 
ing in view of the completely strict enforcement of O(N) 
algorithms everywhere in the code (except the marginal 
N log N factor in the FFT used to solve Poisson's equa- 
tion, which represents a very small fraction of CPU time 
even for 4000 atoms). 



XIV. OTHER FEATURES 

Here we will simply mention some of the possibilities 
and features of the Siesta implementation of DFT: 

• A general-purpose package!!, the flexible data for- 
mat (fdf), initially developped for the Siesta 
project, allows the introduction of all the data 
and precision parameters in a simple tag-oriented, 
order-independent format which accepts different 
physical units. The data can then be accessed from 
anywhere in the program, using simple subroutine 
calls in which a default value is specified for the 
case in which the data are not present. A simple 
call also allows the read pointer to be positioned 
in order to read complex data 'blocks' also marked 
with tags. 

• The systematic calculation of atomic forces and 
stress tensor allows the simultaneous relaxation of 
atomic coordinates and cell shape and size, using a 



FIG. 8: CPU time and memory for silicon supercells of 64, 
512, 1000, 4096, and 8000 atoms. Times are for one aver- 
age molecular dynamics step at 300 K. This includes 10 SCF 
steps, each with 10 conjugate gradient minimization steps of 
the O(N) energy functional. Memories are peak ones. Al- 
though the memory requierement for 8000 atoms was deter- 
mined accurately, the run could not be performed because of 
insufficient memory in the PC used. 

conjugate gradients minimization or several other 
minimization/annealing algorithms. 

• It is possible to perform a variety of molecular dy- 
namics simulations, at constant energy or temper- 
ature, and at constant volume or pressure, also in- 
cluding Pacpnello- Rahman dynamics with variable 
cell shapeELl The geometry relaxation may be re- 
stricted, to impose certain positions or coordinates, 
or more complex constraints. 

• The auxiliary program Vibra processes systemat- 
ically the atomic forces for sets of displaced atomic 
positions, and from them computes the Hessian ma- 
trix and the phonon spectrum. An interface to the 
Phonon programEJis also provided within Siesta. 

• A linear response program (Linres) to calculate 
phonon frequencies has also been developedEI. The 
code reads the SCF solution obtained by Siesta, 
and calculates the linear response to the atomic 
displacements, using first order perturbation the- 
ory. It then calculates the dynamical matrix, from 
which the phonon frequencies are obtained. 

• A number of auxiliary programs allows various rep- 
resentations of the total density, the total and lo- 
cal density of states, and the electrostatic or to- 
tal potentials. The representations include both 
two-dimensional cuts and three-dimensional views, 
which may be colored to simultaneously represent 
the density and potential. 
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• Thanks to an interface with the Transiesta pro- 
gram, it is posible to calculate transport properties 
across a nanocontact, finding self-consistently the 
effective potential across a finite voltage drop, at 
a DFT leaiel, using the Keldish Green's function 
formalismO. 

• The optical response can be studied with siesta us- 
ing different approaches. An approximate dielectric 
function can be calculated from the dipolar tran- 
sition matrix elements between occupied and un- 
occupied single-electron eigenstates usiag first or- 
der time-dependent pertubation theoryO For finite 
systems, these are easily calculated from the matrix 
elements of the position operator between the ba- 
sis orbitals. For infinite periodic systems, we use 
the matrix elements of the momentum operator. 
It is important to notice, however, that the use of 
non-local pseudopotentials requires some correction 
termalj. 

We have also implemented a more sophisticated 
approach to compute the optical response of fi- 
nite systems, using the,-a,diabatic approximation 
to time-dependent DFTlliEI The idea is to in- 
tegrate the time-dependent Schrodinger equation 
when aJime depedent perturbation is applied to the 
systemLll From the time evolution, it is then pos- 
sible to extract the optical adsorption and dipole 
strength functions, including some genuinely many- 
body effects, like plasmons. Using this approach we 
have succesfully calculated the electronic response 
of syste»s such as fullerenes and small metallic 
clustersO. 
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APPENDIX A: ORTHOGONALITY FORCE AND 
STRESS 

We have yet to comment on the force and stress 
terms containing <9p M „/9R/. Substituting the first 
term of Eq. (|8|) into Eqs. (]65|-|67|) and adding the 
first term of Eq. (^2|) we obtain a simple expression: 
X^i, H Vf j,dpfj, v /dHi. Now, p M „ is a function of the Hamil- 
tonian eigenvector coefficients and occupations only 
(Eq. (HH)). On the Born-Oppenheimer surface (BOS), 
E KS is stationary with respect to those coefficients and 
occupations, and the Hellman-Feynmann theorem guar- 
antees that any change of them will not modify the total 
energy to first order, and therefore will not affect the 
forces. In other words, the atomic forces are the partial 
derivatives dE KS /<9Rj at constant c^i and n*. Even in 
the Car-Parrinello scheme, in which the system moves 
out of the BOS, making the Hellman-Feynmann theo- 
rem invalid, the atomic forces are nevertheless defined as 
derivatives at constant c^i and ni. Thus, it may seem 
that the terms dp^/dRi are irrelevant for the calcula- 
tion of the forces. However, in the previous discussion 
we have omitted to say that the KS energy must be min- 
imized under the constraint of orthonormality of the oc- 
cupied states and that, therefore, at the BOS the energy 
is stationary only with respect to changes of ipi which do 
not violate the orthonormality. With an atomic basis set, 
the displacement of atoms (and the deformation of the 
unit cell) modifies the basis, and therefore the occupied 
states ipi = Ylu &pCpii even at constant c^j's. And the 
change of the states affects their orthonormality. Thus, 
in order to calculate the new total energy, we need to 
re-orthonormalize the occupied states, by changing their 
coefficients c^,-. Schematically, we must solve 

(ipi\8S\iPj) + (<W|Vi> + (ipilS^j) = (Al) 

where SS represents the change of due to the atomic 
displacements, and Stpi the modification of ipi due to the 
change of c K ; . Without lack of generality, we can expand 
Sipi in the basis of the eigenvectors ipi as Sipi — J2j ^j^ji- 
Substituting this expansion into ( ]Al[ ) and using that 
(ipi\S\ipj) — Sij we obtain Xji — — h(ipj\SS\ipi). Thus 

l#*> = -~El^il 55 l^ ( A2 ) 

3 

In terms of the coefficients c^i, we have (ipj\6S\ipi) = 
J2fj,i, Cj/iSSpvCi and 

0~C[_Li ^ ^ ] ^ ] CiijCjfjSSrji/Cjji 

j n <v 

= — — S^SSjjvCyi (A3) 
where we have used that c^i = ((p^ipi), and 

c ^ c ™ = = S ^ ( A4 ) 
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Differentiating now Eq. 



and using (A3) we obtain 



Sp^u 
And 



nC 



(A5) 



(A6) 



where E^ v is the so-called energy-density matrix 



fit 1 L l^l^lV 



(A7) 



where are the eigenstate energies. To calculate the or- 
thogonalization force or stress, SS^ must be substituted 
by the appropriate derivative: 



-\orthog 



(A8) 



This equation has_been derived before in different ways, 
and Ordejon et alE2l found it also for the O(N) functional, 
even though it does not require the occupied states to be 
orthogonal. In this case, Eq. (Alarmist be substituted 
by a more complicated expression^. 
Similarly, the stress contribution is 



orthog 

T a/3 — 



(A9) 



Notice that we have extended the integral to the whole 
real axis, defining r) = (—l) l ipi(r), in accordance to 
the behaviour ifji(r) ~ r l , r — > 0. The coefficients c\'^ can 
be obtained by defining a complex polynomial Pi(x\— j= 
Pf(x) + iPf(x), which obeys the recurrence relational 



P (x) = i = V-L 
Pi(x) = i — x 

P+i(x) = (21 + l)Pi(x) - x 2 Pi. 1 {x) 



(B3) 



In order to perform the integrals in (BS) using discrete 
FFT's, we need to calculate ip(r) on a regular radial grid, 
up to a maximum radius r max , beyond which ip(r) is as- 
sumed to be strictly zero. The separation Ar between 
grid points determines a cutoff k max — 7r/Ar in recip- 
rocal space, and vice versa, Afc = Ti/r max . For con- 
volutions, such as those involved in Eq. (p8[), we need 
r m ax = r{ + r\ and k max = max(fcj, fef), where rl >2 , k\ 2 
are the cutoff radii and maximum wavevectors of ^1,2, 
respectively. We must then pad with zeros the inter- 
vals [rl 2,r m ax] for the forward transforms ipi,2( r ) ~* 
ipi,20e). In practice, we set r max = 2 max M (r£), k max = 
max /i (fc^), where p labels all the basis orbitals and KB 
projectors, and we use the same real and reciprocal 
grids for all orbital pairs. In this way, we need to 
perform the forward transform only once for each ra- 
dial function ^(r). Finally, notice that in Eq. (|2q) , 
^h imi ( k )^,i 2m2 (k) ~ k h+l * for k -► 0, while h+l 2 -l is 
even and nonnegative, so that the integrands of Eq. (|B2| ) 
for the backward transform are all even and well behaved 
at the origin. 



APPENDIX B: RADIAL FAST FOURIER 
TRANSFORM 

We consider here how to perform fast integrals of the 
form 

/>oo 

Mk)= / r 2 drji{kr)Mr) (Bl) 
Jo 

where ji(kr) is a spherical Bessel function and ipi(r) 
is a radial function which behaves as ipi( r ) ~ r ' f° r 
r — > 0. Although methods to perform fast Bessel and 
Hankel transfoHiis. have been described previously in dif- 
ferent fieldalj'EJ'Eil we have developed a simple method 
adapted to our needs. It is based on the fact that ji(x) 
has the general form (Pf(x) sin(x) + Pf(x) cos(x)) / x l+1 , 
where P[ S (x), Pf(x) are simple polynomials Pf' c (x) = 

E n =o c tn xn - Thus, the methodJjivolves computing I + 1 
fast sine and cosine transforms^ and add the different 
terms: 

Mk) = E^r r^^)dr 

n=o J -°° 
+ t^r^^(kr) d r(B2) 

n=0 ZK J -°° r 



APPENDIX C: EXTENDED-MESH ALGORITHM 

We describe here a simple and efficient algorithm to 
handle mesh indices in three-dimensional periodic sys- 
tems. Its versatility makes it suitable for several differ- 
ent tasks in Siesta like neighbor-list constructions, ba- 
sis orbital evaluation in the real-space integration grid, 
density-gradient calculations in the GGA, etc. It would 
be also very apropriate for other problems, like the so- 
lution of partial differential equations by real-space dis- 
cretization or the calculation of the interaction energy 
in lattice models. For clarity of the exposition, we will 
describe the algorithm for a specially simple application, 
namely the evaluation of the Laplacian of a function f(r) 
using finite differences, even though the algorithm is not 
used in Siesta for this purpose. In three dimensions, 
one generally discretizes space in all three periodic direc- 
tions, using an index for each direction. For simplicity, 
let us consider an orthorhombic unit cell, with mesh steps 
A:r, Ay, Az. Then the simplest formula for the Laplacian 
is 

^ fixAyAz = (fix + l,i y Az ^fixtiyAz fix — l,iy ~) I ^ x 
+ (/»* ,*».**+! — 2/i x ,i v ,i l + fi„,i y ,ix-l)/Az 
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A direct translation of this expression into Fortran90 
code might read 

Lf(ix,iy,iz) = k 

( f (modulo (ix+1 ,nx) , iy, iz) + & 

f (modulo (ix-1 ,nx) , iy, iz) )/dx2 & 

+ ( f (ix, modulo (iy+1 ,ny) , iz) + & 

f (ix, modulo (iy-1 ,ny) , iz) )/dy2 & 

+ ( f (ix, iy, modulo (iz+1 ,nz) ) + & 

f (ix, iy, modulo (iz-1 ,nz) ) )/dz2 & 
- f(ix,iy,iz) * (2/dx2+2/dy2+2/dz2) 

where the indices i a (a = {x,y,z}) of the arrays f and 
Lf run from to n a — 1, as in C. There are two problems 
with this construction. First, the modulo operations are 
required to bring the indices back to the allowed range 
[0,n a — 1]. And second, the use of three indices to refer 
to a mesh point implies implicit arithmetic operations, 
generated by the compiler, to translate them into a single 
index giving its position in memory. 

A straightforward solution to these inefficiencies would 
be to create a neighbor-point list j_neighb(i,neighb), 
of the size of the number of mesh points times the num- 
ber of neighbor points. However, although the latter are 
only six in our simple example, they may frequently be 
as many as several hundred, what generally makes this 
approach unfeasible. A partial solution, addressing only 
the first problem, is to create six (or more for longer 
ranges) one-dimensional tables Ja 1 ^) = mod(i a ±l, n a ) 
to avoid the modulo computationsEJ. Here, we describe 
a multidimensional generalization of this method, which 
solves both problems at the expense of a very reasonable 
amount of extra storage. 

The method is based on an extended mesh, which ex- 
tends beyond the periodic unit cell, by as much as re- 
quired to cover all the space that can be reached from 
the unit cell by the range of the interactions or the 
finite-difference operator. The extended mesh range is 

■mm = _ Auq and { rnax = ^ _ 1 + where ^ = 1 

in our particular example, in which the Laplacian for- 
mula extends just to first-neighbor mesh points. In prin- 
ciple, in cases with a small unit cell and a long range, the 
mesh extension may be larger than the unit cell itself, 
extending over several neighbor cells. However, in the 
more relevant case of a large system, we will expect the 
extension region to be small compared to the unit cell. 
We then consider two combined indices, one associated 
to the normal unit-cell mesh, and another one associated 
to the extended mesh 



% % X "T Ti X 1y -\~ Tt X Tiy1 Z , 



/- -mm\ i ^ext r ■ ■min\ , 
text = [lx-l x )+ n x \ l y ~\ )+ > 



n-- = i™ ax - i™ n + 1 = n a + 2An Q . The key 
observation is that, if i ext is a mesh point within the 
unit cell (0 < i a < n a — 1), and if j ext is a neighbor 
mesh point (within its interaction range, i.e. \j a — i a \ < 
An a ), then the arithmetic difference jext — iext depends 



only on the relative positions of i ex t and j ex t (i.e. on 
ja ~ ia), but not on the position of i ext within the unit 
cell. We can then create a list of neighbor strides Aij ext , 
and two arrays to translate back and forth between i and 
iext- One of the arrays maps the unit cell points to the 
central region of the extended mesh, while the other one 
folds back the extended mesh points to their periodically 
equivalent points within the unit cell. Then, to access the 
neighbors of a point i, we a) translate i — > i ex t', b) find 
jext = iext + Aijext] and c) translate j ex t — > j- Notice 
that several points of the extended mesh will map to the 
same point within the unit cell and that, in principle, a 
unit cell point j may be neighbor of i through different 
values of jext- In our example, the innermost loop would 
then read 

Lf(i) = 

do neighb = l,n_neighb 

j_ext = i_extended(i) + ij_delta(neighb) 

j = i_cell(j_ext) 

Lf(i) = Lf(i) + L(neighb) * f(j) 
end do 

where the number of neighbor points would be 
n_neighb=7, including the central point itself, and 



ij 


_delta(l) = 


1 






L(l) = l/dx2 


ij 


_delta(2) = 


-1 






L(l) = l/dx2 


ij 


_delta(3) = 


nx. 


_ext 




L(3) = l/dy2 


ij 


_delta(4) = 


-nx. 


_ext 




L(4) = l/dy2 


ij 


_delta(5) = 


nx. 


_ext*ny_ 


.ext 


L(5) = l/dz2 


ij 


_delta(6) = 


-nx. 


_ext*ny_ 


.ext 


L(6) = l/dz2 


ij 


_delta(7) = 





; L(7) = 


=-2/dx2-2/dy2-2/dz2 



Notice that the above loop is completely general, for any 
linear operator, using an arbitrary number of neighbor 
points for its finite difference representation. In fact, it 
is even independent of the space dimensionality. Further- 
more, the index operations required are just one addition 
and three memory calls to arrays of range oneo. This in- 
ner loop simplicity comes at the expense of the two extra 
arrays i_extended and i_cell (of the size of the normal 
and extended meshes, respectively) which are generally 
an acceptable memory overhead. Notice, however, that 
the the neighbor-point list ij_delta is independent of 
the mesh index i, what makes this array quite small in 
most problems of interest. 



APPENDIX D: SPARSE MATRIX TECHNIQUES 

We will describe here some of the sparse-matrix mul- 
tiplication techniques used in evaluating Eqs. (|35|) 
and 



(p7|), and (88). There is a large variety of sparse-matrix 
representations and algorithms, each one optimized for a 
different kind of sparsity. The main constraint for choos- 
ing our representation and algorithms is that they must 
be O(N) in both memory and CPU time. We enforce 
this condition strictly by requiring, for example, that a 
vector of size ~ N will not be reset to zero a number 
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~ N of times. In our sparse matrices, like S^„, H^, c^i, 
p^ ui and <^>u(r), the number p of non-zero elements in a 
row is typically much larger than one (but still of order 
~ N°) and much smaller than the row size m ^ N 1 . 
Such matrix rows are efficiently stored as a real vector of 
size p, containing the non-zero elements, and an integer 
vector of the same size containing the column index of 
each non-zero element. The whole matrix A of n rows is 
then represented by two arrays A and jcol, of size n x p, 
such thaC3 A^i — A(i,k), where j — jcol(i,k). The 
problem with this representation is that, given a value 
j of the column index, there is no simple way to access 
the element Aij without scanning the whole row, what 
is frequently too costly. One solution is to unpack a row 
2, that will be repeatedly used, into 'expanded form', i.e. 
to transfer it to a vector Arow of the full row size m (con- 
taining also all the zeros), so that Ay = Arow(j). Since 
p >> 1, the size of Arow is negligible compared to that 
of A and jcol. 

To find the matrix product C of two sparse matrices 
A and B 

Cik = ^ ' AijBjk 

3 

we proceed iteratively for each row i of A (which will 
generate the same row of C): each non-zero element j 
of the row is multiplied by every non-zero element of the 
jth row of B (whose column index is, say k) and the 
result is accumulated in the fc'th position of an auxiliary 
'expanded' vector. After finishing with that row of A we 
pack the vector in sparse format into the ith row of C and 
restore the auxiliary vector to zero. In fact, the packing 
can be performed simultaneously with the product, using 
an auxiliary index vector instead: 

C = 0. 
ncolC = 
jcolC = 

pos = ! Auxiliary index vector 
do i = 1 , nA 

do jA = l,ncolA(i) 
j = jcolA(i,jA) 
do jB = l,ncolB(j) 
k = jcolB(j,jB) 
jC = pos(k) 

if (jC==0) then ! New non-zero col 

ncolC(i) = ncolC(i) + 1 

jC = ncolC(i) 

jcolC(i,jC) = k 

pos(k) = jC 
endif 

C(i,jC) = C(i,jC) + A(i,jA)*B(j,jB) 
enddo 
enddo 

do jA = l,ncolA(i) ! Restore pos to zero 
j = jcolA(i,jA) 
do jB = l,ncolB(j) 
k = jcolB(j,jB) 



pos(k) = 
enddo 
enddo 
enddo 

Notice that the auxiliary vector pos, which keeps the 
position in 'packed format' of the non-zero elements of 
one row of C, is initiallizcd in full only once. Notice 
also that this algorithm, unlike those of ref. |3l], does not 
require the matrix elements to be stored in ascending 
column order. 

The previous algorithm generates all the non-zero ele- 
ments of C but in many cases we need only some of them. 
For example, to calculate the electron density (Eq. ([35])), 
we need only the density matrix elements p uv for which 
cftfj, and 4> v overlap. Also the expression (pq) needs to 
be evaluated only for the coefficients c^i which are al- 
lowed to be non-zero by the localization constraints. In 
these cases, in which the array j colC is already known, 
another algorithm is more effective. We start by find- 
ing the sparse representation of B in column order or, in 
other words, the transpose of B: 

Bt = ! B transpose 
jcolBt = 
ncolBt = 
do i = l,nB 

do jB = l,ncolB(i) 

j = jcolB(i,jB) 

ncolBt(j) = ncolBt(j) + 1 

jBt = ncolBt(j) 

jcolBt(j , jBt) = i 

Bt(j.jBt) = B(i,jB) 
enddo 
enddo 

We then unpack a row i of A and multiply it by a column 
j of B (a row of its transpose) for each required matrix 
element Cy of their product: 

C = 0. 

Arow =0. ! Auxiliary vector 
do i = 1 , nC 

do jA = l,ncolA(i) ! Copy one row of A 
j = jcolA(i,jA) 
Arow(j) = A(i,jA) 
enddo 

do jC = l,ncolC(i) ! Calculate Cij 
j = jcolC(i,jC) 
do jBt = l,ncolBt(j) 
k = jcolBt(j , jBt) 

C(i,jC) = C(i,jC) + Arow(k)*Bt(j , jBt) 
enddo 
enddo 

do jA = l,ncolA(i) ! Restore Arow to zero 

j = jcolA(i,jA) 

Arow(j) = 0. 
enddo 
enddo 
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The combination of these two matrix multiplication al- 
gorithms allows an efficient evaluation of Eqs. (|86|-[S8|). 
Since these equations involve a trace or a relatively small 
subset of a final matrix, it is important to control the 
order and sparsity of the intermediate products, in order 
to keep them as sparse as possible. Notice that, once a 
row of A x B has been evaluated, it may be multiplied 
by a third matrix, to obtain a row of the final product, 
without need of storing the whole intermediate matrix. 

To calculate the density at a grid point using Eq. j35| ) 
we need to access the matrix elements and this is in- 
efficient if they are stored in sparse format. Thus, we first 
copy the matrix elements, between the n r basis orbitals 



which are non-zero at the grid point r, into an auxiliary 
matrix array, of size n aux x n aux , with n aux > n r . We 
also create a lookup table pos, of size equal to the to- 
tal number of basis orbitals, such that pos(mu) is the 
position, in the auxiliary matrix, of the matrix elements 
of orbital mu (or zero if they have not been copied to 
it). If there are new non-zero orbitals at the next grid 
points, we keep copying them into the auxiliary matrix, 
until all its n aux slots are full, at which point we erase it 
and restart the process. Since succesive grid points tend 
to contain the same non-zero basis orbitals, these copies 
and erasures are not frequent. 
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